昨天講完了如何處理timestamp()的時間格式該如何處理,今天就讓我們來看看,最大宗的部分,字串!
講到字串,最常使用的就是大名鼎鼎的split(),substr()
顧名思義,也就是當我們需要針對一個欄位進行切分時,最常使用的
那下面我們來看看如何使用吧!
split()
split(col,pattern,limit)
col
: 就是你需要切分的Column Namepattern
: 你要用哪些條件進行切分limit
: optionallimit > 0
: 組數不會超過limitlimit <= 0
: 組數無限制
情境說明:通常會使用在,一個單一欄位中隱含多個不同的資訊,但User很歡,就是叫你要好好切開來啊
的時候適合使用XD
rdd = sc.parallelize(
[
("drink,dance", 2, "Carmen",23,'Female'),
("movie,music", 2, "Juliette",16,'Female'),
("write,learn", 2, "Don José",25,'Male'),
("sleep,hangout", 2, "Escamillo",30,'Male'),
("play,dance", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## 我全都要,並且顯示的data type 是一個array
df.select(split(df['Thing'],',').alias('sperate_split')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(split(df['Thing'],',').alias('sperate_split')).show()
root
|-- sperate_split: array (nullable = true)
| |-- element: string (containsNull = true)
+----------------+
| sperate_split|
+----------------+
| [drink, dance]|
| [movie, music]|
| [write, learn]|
|[sleep, hangout]|
| [play, dance]|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
split().getItem()
但是我今天如果只想要切分後的某一個特定的位置, 以上述的例子來說的話,我只想要知道他們最喜歡的第一個活動是在做什麼的話,該怎麼處理呢?
這時候就會需要用到combo 組合技囉!
可以使用split().getItem()
達到這樣的效果
可以藉由調整getItem(num)
就可以選擇你想要的位置!
rdd = sc.parallelize(
[
("drink,dance", 2, "Carmen",23,'Female'),
("movie,music", 2, "Juliette",16,'Female'),
("write,learn", 2, "Don José",25,'Male'),
("sleep,hangout", 2, "Escamillo",30,'Male'),
("play,dance", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## 我只要一個,所以得到的就會是一個字串,不需要額外再轉換型態(change DataType)
df.select(split(df['Thing'],',').getItem(0).alias('sperate_split_single')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(split(df['Thing'],',').getItem(0).alias('sperate_split_single')).show()
root
|-- sperate_split_single: string (nullable = true)
+--------------------+
|sperate_split_single|
+--------------------+
| drink|
| movie|
| write|
| sleep|
| play|
+--------------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
當然getItem()的作用其實不只這些,這個有空我們再來好好介紹吧~
substr()
如果你跟我一樣做Data Engineer 的入行也是SQL的話,想必對她一定也不陌生吧
對,就是你當年又愛又恨,很常拿來做where
條件,避免使用like
來提高效率的東西啊substr()
: 也就是substring()
的簡寫substr(start-postition, lenght)
start-postition
: 顧名思義,這個就是開始的位置,你要從這個字串的第幾個位置開始lenght
: 這個也就是,你要切幾個位置
情境說明:當你有明確且固定的字串可切的時候可以使用,比如說你要判斷這個欄位是不是真的是塞網址時
你就可以使用substr(1,5)=='https'
rdd = sc.parallelize(
[
("drink,dance,play", 2, "Carmen",23,'Female'),
("movie,music", 2, "Juliette",16,'Female'),
("write,learn", 2, "Don José",25,'Male'),
("sleep,hangout", 2, "Escamillo",30,'Male'),
("play,dance", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.select(split(df['Thing'],',').getItem(0).alias('sperate_split_single')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+----------------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+----------------+----+---------+---+------+
|drink,dance,play| 2| Carmen| 23|Female|
| movie,music| 2| Juliette| 16|Female|
| write,learn| 2| Don José| 25| Male|
| sleep,hangout| 2|Escamillo| 30| Male|
| play,dance| 2| Roméo| 18| Male|
+----------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
## 可以看到,因為你是用字串的位置去切的,所以在最後一欄則會多切到一個逗點
df.select(col('Thing').substr(1,5).alias('sperate_substr')).show()
root
|-- sperate_split_single: string (nullable = true)
+--------------+
|sperate_substr|
+--------------+
| drink|
| movie|
| write|
| sleep|
| play,|
+--------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
希望今天的內容可以對大家有幫助,畢竟在切字串的辛酸血淚史,可是那些分析師們不懂,只有Data Engineer深受其害的過程啊
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】